The Salifort Motors Project¶
Table of Contents¶
Introduction¶
Description and deliverables¶
- This project is designed to showcase analytical skills in exploring a dataset and building predictive models that can provide insights to the Human Resources (HR) department of a large consulting firm.
The Purpose: Salifort Motors is seeking a method to use employee data to gauge what makes an employee leave the company.
The Goal: To design a model that predicts whether an employee will leave the company based on their job title, department, number of projects, average monthly hours, and any other relevant data points.
Desired Outcome: To develop a good model that will help the company increase retention and job satisfaction for current employees, and save money and time training new employees.
PACE: Plan¶
Understanding the business scenario and problem¶
The HR department at Salifort Motors wants to take some initiatives to improve employee satisfaction levels at the company. They collected data from employees, but now they don’t know what to do with it. They refer to me as a data analytics professional and asked me to provide data-driven suggestions based on my understanding of the data. They have the following question:
- What’s likely to make the employee leave the company?
Note: If I can predict employees likely to quit, it might be possible to identify factors that contribute to their leaving. Because it is time-consuming and expensive to find, interview, and hire new employees, increasing employee retention will be beneficial to the company.
Step 1. Imports¶
# Import packages
### YOUR CODE HERE ###
# For data manipulation
import numpy as np
import pandas as pd
# For data visualization
import matplotlib.pyplot as plt
import matplotlib.lines as mlines
import seaborn as sns
# For displaying all of the columns in dataframes
pd.set_option('display.max_columns', None)
# For data modeling
from xgboost import XGBClassifier
from xgboost import XGBRegressor
from xgboost import plot_importance
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
# For metrics and helpful functions
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score,\
f1_score, confusion_matrix, ConfusionMatrixDisplay, classification_report
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.tree import plot_tree
# For saving models
import pickle
import os
Loading the dataset¶
# Loading dataset into a dataframe
df0 = pd.read_csv("HR_capstone_dataset.csv")
# Displaying first few rows of the dataframe
df0.head()
| satisfaction_level | last_evaluation | number_project | average_montly_hours | time_spend_company | Work_accident | left | promotion_last_5years | Department | salary | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.38 | 0.53 | 2 | 157 | 3 | 0 | 1 | 0 | sales | low |
| 1 | 0.80 | 0.86 | 5 | 262 | 6 | 0 | 1 | 0 | sales | medium |
| 2 | 0.11 | 0.88 | 7 | 272 | 4 | 0 | 1 | 0 | sales | medium |
| 3 | 0.72 | 0.87 | 5 | 223 | 5 | 0 | 1 | 0 | sales | low |
| 4 | 0.37 | 0.52 | 2 | 159 | 3 | 0 | 1 | 0 | sales | low |
Step 2. Data Exploration¶
- Understanding the variables
- Cleaning the dataset (missing data, redundant data, outliers)
# Gathering basic information about the data
df0.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 14999 entries, 0 to 14998 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 satisfaction_level 14999 non-null float64 1 last_evaluation 14999 non-null float64 2 number_project 14999 non-null int64 3 average_montly_hours 14999 non-null int64 4 time_spend_company 14999 non-null int64 5 Work_accident 14999 non-null int64 6 left 14999 non-null int64 7 promotion_last_5years 14999 non-null int64 8 Department 14999 non-null object 9 salary 14999 non-null object dtypes: float64(2), int64(6), object(2) memory usage: 1.1+ MB
# Gathering descriptive statistics about the data
df0.describe()
| satisfaction_level | last_evaluation | number_project | average_montly_hours | time_spend_company | Work_accident | left | promotion_last_5years | |
|---|---|---|---|---|---|---|---|---|
| count | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 |
| mean | 0.612834 | 0.716102 | 3.803054 | 201.050337 | 3.498233 | 0.144610 | 0.238083 | 0.021268 |
| std | 0.248631 | 0.171169 | 1.232592 | 49.943099 | 1.460136 | 0.351719 | 0.425924 | 0.144281 |
| min | 0.090000 | 0.360000 | 2.000000 | 96.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.440000 | 0.560000 | 3.000000 | 156.000000 | 3.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 0.640000 | 0.720000 | 4.000000 | 200.000000 | 3.000000 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 0.820000 | 0.870000 | 5.000000 | 245.000000 | 4.000000 | 0.000000 | 0.000000 | 0.000000 |
| max | 1.000000 | 1.000000 | 7.000000 | 310.000000 | 10.000000 | 1.000000 | 1.000000 | 1.000000 |
Renaming columns¶
- As a data cleaning step, I'll rename the columns as needed. I'll standardize the column names so that they are all in
snake_case, correct any columns that are misspelled, and make column names more concise as needed.
# Displaying all column names
df0.columns
Index(['satisfaction_level', 'last_evaluation', 'number_project',
'average_montly_hours', 'time_spend_company', 'Work_accident', 'left',
'promotion_last_5years', 'Department', 'salary'],
dtype='object')
# Renaming columns as needed
df0 = df0.rename(columns={'Work_accident': 'work_accident',
'average_montly_hours': 'average_monthly_hours',
'time_spend_company': 'tenure',
'Department': 'department'})
# Displaying all column names after the update
df0.columns
Index(['satisfaction_level', 'last_evaluation', 'number_project',
'average_monthly_hours', 'tenure', 'work_accident', 'left',
'promotion_last_5years', 'department', 'salary'],
dtype='object')
Checking for missing values¶
df0.isna().sum()
satisfaction_level 0 last_evaluation 0 number_project 0 average_monthly_hours 0 tenure 0 work_accident 0 left 0 promotion_last_5years 0 department 0 salary 0 dtype: int64
Observation: There are no missing values within this dataset.
Checking for duplicates¶
df0.duplicated().sum()
3008
Observation: 3,008 rows contain duplicates. That is 20% of the dataset.
# Inspecting some rows containing duplicates
df0[df0.duplicated()].head()
| satisfaction_level | last_evaluation | number_project | average_monthly_hours | tenure | work_accident | left | promotion_last_5years | department | salary | |
|---|---|---|---|---|---|---|---|---|---|---|
| 396 | 0.46 | 0.57 | 2 | 139 | 3 | 0 | 1 | 0 | sales | low |
| 866 | 0.41 | 0.46 | 2 | 128 | 3 | 0 | 1 | 0 | accounting | low |
| 1317 | 0.37 | 0.51 | 2 | 127 | 3 | 0 | 1 | 0 | sales | medium |
| 1368 | 0.41 | 0.52 | 2 | 132 | 3 | 0 | 1 | 0 | RandD | low |
| 1461 | 0.42 | 0.53 | 2 | 142 | 3 | 0 | 1 | 0 | sales | low |
Observations:
- The above output shows the first five occurences of rows that are duplicated farther down in the dataframe. How likely is it that these are legitimate entries? In other words, how plausible is it that two employees self-reported the exact same response for every column?
- I could perform a likelihood analysis by essentially applying Bayes' theorem and multiplying the probabilities of finding each value in each column, but this does not seem necessary. With several continuous variables across 10 columns, it seems very unlikely that these observations are legitimate. I'll proceed by dropping them.
# Dropping duplicates and saving resulting dataframe in a new variable
df1 = df0.drop_duplicates(keep='first')
# Displaying first few rows of the new dataframe
df1.head()
| satisfaction_level | last_evaluation | number_project | average_monthly_hours | tenure | work_accident | left | promotion_last_5years | department | salary | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.38 | 0.53 | 2 | 157 | 3 | 0 | 1 | 0 | sales | low |
| 1 | 0.80 | 0.86 | 5 | 262 | 6 | 0 | 1 | 0 | sales | medium |
| 2 | 0.11 | 0.88 | 7 | 272 | 4 | 0 | 1 | 0 | sales | medium |
| 3 | 0.72 | 0.87 | 5 | 223 | 5 | 0 | 1 | 0 | sales | low |
| 4 | 0.37 | 0.52 | 2 | 159 | 3 | 0 | 1 | 0 | sales | low |
Checking for outliers¶
# Creating a boxplot to visualize distribution of `tenure` and detect any outliers
plt.figure(figsize=(6,6))
plt.title('Boxplot to detect outliers for tenure', fontsize=12)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
sns.boxplot(x=df1['tenure'])
plt.show()
Observations:
- The boxplot shows that there are outliers in the
tenurevariable.- I'll investigate how many rows in the data contain outliers in the
tenurecolumn.
# Determining the number of rows containing outliers
# Computing the 25th percentile value in `tenure`
percentile25 = df1['tenure'].quantile(0.25)
# Computing the 75th percentile value in `tenure`
percentile75 = df1['tenure'].quantile(0.75)
# Computing the interquartile range in `tenure`
iqr = percentile75 - percentile25
# Defining the upper limit and lower limit for non-outlier values in `tenure`
upper_limit = percentile75 + 1.5 * iqr
lower_limit = percentile25 - 1.5 * iqr
print("Lower limit:", lower_limit)
print("Upper limit:", upper_limit)
# Identifying subset of data containing outliers in `tenure`
outliers = df1[(df1['tenure'] > upper_limit) | (df1['tenure'] < lower_limit)]
# Counting how many rows in the data contain outliers in `tenure`
print("Number of rows in the dataset containing outliers in `tenure`:", len(outliers))
Lower limit: 1.5 Upper limit: 5.5 Number of rows in the dataset containing outliers in `tenure`: 824
Observation:
- Certain types of models are more sensitive to outliers than others. When I get to the stage of building the model, I'll consider whether to remove these outliers based on the type of model I decide to use.
Step 3. Data Exploration (EDA continued)¶
To begin I'll need to understand how many employees left and what percentage of all employees this represents.
# Getting number of people who left vs. stayed
print(df1['left'].value_counts())
print()
# Getting percentages of people who left vs. stayed
print(df1['left'].value_counts(normalize=True))
left 0 10000 1 1991 Name: count, dtype: int64 left 0 0.833959 1 0.166041 Name: proportion, dtype: float64
Data visualizations¶
Now, I'll examine variables that I am interested in, and create plots to visualize relationships between variables in the data.
I'll start by creating a stacked boxplot showing
average_monthly_hoursdistributions fornumber_project, comparing the distributions of employees who stayed versus those who left.Box plots are very useful in visualizing distributions within data, but they can be deceiving without the context of how big the sample sizes that they represent are. So, I'll plot a stacked histogram to visualize the distribution of
number_projectfor those who stayed and those who left.
# Setting figure and axes
fig, ax = plt.subplots(1, 2, figsize = (22,8))
# Creating boxplot showing `average_monthly_hours` distributions for `number_project`, comparing employees who stayed versus those who left
sns.boxplot(data=df1, x='average_monthly_hours', y='number_project', hue='left', orient="h", ax=ax[0])
ax[0].invert_yaxis()
ax[0].set_title('Monthly hours by number of projects', fontsize='14')
# Creating histogram showing distribution of `number_project`, comparing employees who stayed versus those who left
tenure_stay = df1[df1['left']==0]['number_project']
tenure_left = df1[df1['left']==1]['number_project']
sns.histplot(data=df1, x='number_project', hue='left', multiple='dodge', shrink=2, ax=ax[1])
ax[1].set_title('Number of projects histogram', fontsize='14')
# Displaying the plots
plt.show()
Observations:
It might be natural that people who work on more projects would also work longer hours. This appears to be the case here, with the mean hours of each group (stayed and left) increasing with number of projects worked. However, a few things stand out from this plot.
There are two groups of employees who left the company: (A) those who worked considerably less than their peers with the same number of projects, and (B) those who worked much more. Of those in group A, it's possible that they were fired. It's also possible that this group includes employees who had already given their notice and were assigned fewer hours because they were already on their way out the door. For those in group B, it's reasonable to infer that they probably quit. The folks in group B likely contributed a lot to the projects they worked in; they might have been the largest contributors to their projects.
Everyone with seven projects left the company, and the interquartile ranges of this group and those who left with six projects was ~255–295 hours/month—much more than any other group.
The optimal number of projects for employees to work on seems to be 3–4. The ratio of left/stayed is very small for these cohorts.
If I assume a work week of 40 hours and two weeks of vacation per year, then the average number of working hours per month of employees working Monday–Friday
= 50 weeks * 40 hours per week / 12 months = 166.67 hours per month. This means that, aside from the employees who worked on two projects, every group—even those who didn't leave the company—worked considerably more hours than this. It seems that employees here are overworked.
As the next step, I'll confirm that all employees with seven projects left.
# Getting value counts of stayed/left for employees with 7 projects
df1[df1['number_project']==7]['left'].value_counts()
left 1 145 Name: count, dtype: int64
Observation:
- This confirms that all employees with
7projects did leave.
Next, I'll examine the average monthly hours versus the satisfaction levels.
# Creating a scatterplot of `average_monthly_hours` versus `satisfaction_level`, comparing employees who stayed versus those who left
plt.figure(figsize=(16, 9))
custom_palette = {0: "#1f77b4", 1: "#ff7f0e"}
sns.scatterplot(data=df1, x='average_monthly_hours', y='satisfaction_level', hue='left', alpha=0.4, palette=custom_palette)
plt.axvline(x=166.67, color='#ff6361', label='166.67 hrs./mo.', ls='--')
# Custom legend handles
red_dashed_line = mlines.Line2D([], [], color='#ff6361', linestyle='--', label="166.67 hrs./mo.")
blue_dot = mlines.Line2D([], [], color="#1f77b4", marker='o', linestyle='None', markersize=6, label="Stayed")
orange_dot = mlines.Line2D([], [], color="#ff7f0e", marker='o', linestyle='None', markersize=6, label="Left")
plt.legend(handles=[red_dashed_line, orange_dot, blue_dot])
plt.title('Monthly hours by last evaluation score', fontsize='14');
Observations:
- The scatterplot above shows that there was a sizeable group of employees who worked 240–315 hours per month. 315 hours per month is over 75 hours per week for a whole year. It's likely this is related to their satisfaction levels being close to zero.
- The plot also shows another group of people who left, those who had more normal working hours. Even so, their satisfaction was only around 0.4. It's difficult to speculate about why they might have left. It's possible they felt pressured to work more, considering so many of their peers worked more. And that pressure could have lowered their satisfaction levels.
- Finally, there is a group who worked 210–280 hours per month, and they had satisfaction levels ranging 0.7–0.9.
- Note the strange shape of the distributions here. This is indicative of data manipulation or synthetic data.
Next, it will be interesting to visualize satisfaction levels by tenure.
# Setting figure and axes
fig, ax = plt.subplots(1, 2, figsize = (22,8))
# Creating a boxplot showing distributions of `satisfaction_level` by tenure, comparing employees who stayed versus those who left
sns.boxplot(data=df1, x='satisfaction_level', y='tenure', hue='left', orient="h", ax=ax[0])
ax[0].invert_yaxis()
ax[0].set_title('Satisfaction by tenure', fontsize='14')
# Creating histogram showing distribution of `tenure`, comparing employees who stayed versus those who left
tenure_stay = df1[df1['left']==0]['tenure']
tenure_left = df1[df1['left']==1]['tenure']
sns.histplot(data=df1, x='tenure', hue='left', multiple='dodge', shrink=5, ax=ax[1])
ax[1].set_title('Tenure histogram', fontsize='14')
plt.show();
Observations:
- Employees who left fall into two general categories: dissatisfied employees with shorter tenures and very satisfied employees with medium-length tenures.
- Four-year employees who left seem to have an unusually low satisfaction level. It's worth me investigating changes to company policy that might have affected people specifically at the four-year mark, if possible.
- The longest-tenured employees didn't leave. Their satisfaction levels aligned with those of newer employees who stayed.
- The histogram shows that there are relatively few longer-tenured employees. It's possible that they're the higher-ranking, higher-paid employees.
As the next step in analyzing this dataset, I'll calculate the mean and median satisfaction scores of employees who left and those who didn't.
# Calculating the mean and median satisfaction scores of employees who left and those who stayed
df1.groupby(['left'])['satisfaction_level'].agg(["mean", "median"])
| mean | median | |
|---|---|---|
| left | ||
| 0 | 0.667365 | 0.69 |
| 1 | 0.440271 | 0.41 |
Observations:
- As expected, the mean and median satisfaction scores of employees who left are lower than those of employees who stayed. Interestingly, among employees who stayed, the mean satisfaction score appears to be slightly below the median score. This indicates that satisfaction levels among those who stayed might be skewed to the left.
Next, I'll examine salary levels for different tenures.
# Setting figure and axes
fig, ax = plt.subplots(1, 2, figsize = (22,8))
# Defining short-tenured employees
tenure_short = df1[df1['tenure'] < 7]
# Defining long-tenured employees
tenure_long = df1[df1['tenure'] > 6]
# Plotting short-tenured histogram
sns.histplot(data=tenure_short, x='tenure', hue='salary', discrete=1,
hue_order=['low', 'medium', 'high'], multiple='dodge', shrink=.5, ax=ax[0])
ax[0].set_title('Salary histogram by tenure: short-tenured people', fontsize='14')
# Plotting long-tenured histogram
sns.histplot(data=tenure_long, x='tenure', hue='salary', discrete=1,
hue_order=['low', 'medium', 'high'], multiple='dodge', shrink=.4, ax=ax[1])
ax[1].set_title('Salary histogram by tenure: long-tenured people', fontsize='14');
Observation:
- The plots above show that long-tenured employees were not disproportionately comprised of higher-paid employees.
Next, I'll explore whether there's a correlation between working long hours and receiving high evaluation scores. I'll create a scatterplot of average_monthly_hours versus last_evaluation.
# Creating a scatterplot of `average_monthly_hours` versus `last_evaluation`
plt.figure(figsize=(16, 9))
custom_palette = {0: "#1f77b4", 1: "#ff7f0e"}
sns.scatterplot(data=df1, x='average_monthly_hours', y='last_evaluation', hue='left', alpha=0.4, palette=custom_palette)
plt.axvline(x=166.67, color='#ff6361', label='166.67 hrs./mo.', ls='--')
# Custom legend handles
red_dashed_line = mlines.Line2D([], [], color='#ff6361', linestyle='--', label="166.67 hrs./mo.")
blue_dot = mlines.Line2D([], [], color="#1f77b4", marker='o', linestyle='None', markersize=6, label="Stayed")
orange_dot = mlines.Line2D([], [], color="#ff7f0e", marker='o', linestyle='None', markersize=6, label="Left")
plt.legend(handles=[red_dashed_line, orange_dot, blue_dot])
plt.title('Monthly hours by last evaluation score', fontsize='14');
Observations:
- The scatterplot indicates two groups of employees who left: overworked employees who performed very well and employees who worked slightly under the nominal monthly average of 166.67 hours with lower evaluation scores.
- There seems to be a correlation between hours worked and evaluation score.
- There isn't a high percentage of employees in the upper left quadrant of this plot; but working long hours doesn't guarantee a good evaluation score.
- Most of the employees in this company work well over 167 hours per month.
Next, I'll examine whether employees who worked very long hours were promoted in the last five years.
# Creating a plot to examine the relationship between `average_monthly_hours` and `promotion_last_5years`
plt.figure(figsize=(16, 3))
custom_palette = {0: "#1f77b4", 1: "#ff7f0e"}
sns.scatterplot(data=df1, x='average_monthly_hours', y='promotion_last_5years', hue='left', alpha=0.4, palette=custom_palette)
plt.axvline(x=166.67, color='#ff6361', ls='--')
# Custom legend handles
red_dashed_line = mlines.Line2D([], [], color='#ff6361', linestyle='--', label="166.67 hrs./mo.")
blue_dot = mlines.Line2D([], [], color="#1f77b4", marker='o', linestyle='None', markersize=6, label="Stayed")
orange_dot = mlines.Line2D([], [], color="#ff7f0e", marker='o', linestyle='None', markersize=6, label="Left")
plt.legend(handles=[red_dashed_line, orange_dot, blue_dot])
plt.title('Monthly hours by promotion last 5 years', fontsize='14');
Observations:
- Very few employees who were promoted in the last five years left
- Very few employees who worked the most hours were promoted
- All of the employees who left were working the longest hours
Next, I'll inspect how the employees who left are distributed across departments
# Displaying counts for each department
df1["department"].value_counts()
department sales 3239 technical 2244 support 1821 IT 976 RandD 694 product_mng 686 marketing 673 accounting 621 hr 601 management 436 Name: count, dtype: int64
# Creating a stacked histogram to compare department distribution of employees who left to that of employees who didn't
plt.figure(figsize=(11,8))
sns.histplot(data=df1, x='department', hue='left', discrete=1,
hue_order=[0, 1], multiple='dodge', shrink=.5)
plt.xticks(rotation='45')
plt.title('Counts of stayed/left by department', fontsize=14);
Observation:
- There doesn't seem to be any department that differs significantly in its proportion of employees who left to those who stayed.
# Plotting a correlation heatmap
plt.figure(figsize=(16, 9))
# Selecting only numeric columns
df_numeric = df0.select_dtypes(include=['number'])
# Generating the heatmap
heatmap = sns.heatmap(df_numeric.corr(), vmin=-1, vmax=1, annot=True, cmap=sns.color_palette("vlag", as_cmap=True))
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':14}, pad=12)
plt.show()
Observations:
- The correlation heatmap confirms that the number of projects, monthly hours, and evaluation scores all have some positive correlation with each other, and whether an employee leaves is negatively correlated with their satisfaction level.
Insights¶
- It appears that employees are leaving the company as a result of poor management. Leaving is tied to longer working hours, many projects, and generally lower satisfaction levels. It can be ungratifying to work long hours and not receive promotions or good evaluation scores. There's a sizeable group of employees at this company who are probably burned out. It also appears that if an employee has spent more than six years at the company, they tend not to leave.
PACE: Construct¶
Within the construct phase of the PACE framework I will:
- Determine which models are most appropriate
- Construct the model
- Confirm model assumptions
- Evaluate model results to determine how well your model fits the data
Model assumptions¶
Logistic Regression model assumptions
- Outcome variable is categorical
- Observations are independent of each other
- No severe multicollinearity among X variables
- No extreme outliers
- Linear relationship between each X variable and the logit of the outcome variable
- Sufficiently large sample size
Step 3. Model Building¶
Within the next steps I'll complete the following:
- Fit a model that predicts the outcome variable using two or more independent variables
- Check model assumptions
- Evaluate the model
Identifying the type of prediction task¶
- The goal is to predict whether an employee leaves the company, which is a categorical outcome variable. So this task involves classification. More specifically, this involves binary classification, since the outcome variable
leftcan be either1(indicating employee left) or0(indicating employee did not leave).
Identifying the types of models most appropriate for this task¶
- Since the variable I want to predict (whether an employee leaves the company) is categorical, I could either build a Logistic Regression model, or a Tree-based Machine Learning model.
To proceed, I'll implement both and determine how they compare.
First: Building a Logistic Regression Model¶
- Binomial logistic regression suits the task because it involves binary classification.
- Before splitting the data, I'll encode the non-numeric variables. Within this dataset there are two:
departmentandsalary.departmentis a categorical variable, which means I can dummy it for modeling.salaryis categorical also, but it's ordinal. There's a hierarchy to the categories, so it's better not to dummy this column, but rather to convert the levels to numbers, 0-2.
# Copying the dataframe
df_enc = df1.copy()
# Encoding the `salary` column as an ordinal numeric category
df_enc['salary'] = (
df_enc['salary'].astype('category')
.cat.set_categories(['low', 'medium', 'high'])
.cat.codes
)
# Dummy encoding the `department` column
df_enc = pd.get_dummies(df_enc, drop_first=False)
# Displaying the new dataframe
df_enc.head()
| satisfaction_level | last_evaluation | number_project | average_monthly_hours | tenure | work_accident | left | promotion_last_5years | salary | department_IT | department_RandD | department_accounting | department_hr | department_management | department_marketing | department_product_mng | department_sales | department_support | department_technical | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.38 | 0.53 | 2 | 157 | 3 | 0 | 1 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
| 1 | 0.80 | 0.86 | 5 | 262 | 6 | 0 | 1 | 0 | 1 | False | False | False | False | False | False | False | True | False | False |
| 2 | 0.11 | 0.88 | 7 | 272 | 4 | 0 | 1 | 0 | 1 | False | False | False | False | False | False | False | True | False | False |
| 3 | 0.72 | 0.87 | 5 | 223 | 5 | 0 | 1 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
| 4 | 0.37 | 0.52 | 2 | 159 | 3 | 0 | 1 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
Next, I'll create a heatmap to visualize how correlated variables are. I'll consider which variables I am interested in examining correlations between.
# Creating a heatmap to visualize how correlated variables are
plt.figure(figsize=(8, 6))
sns.heatmap(df_enc[['satisfaction_level', 'last_evaluation', 'number_project', 'average_monthly_hours', 'tenure']]
.corr(), annot=True, cmap="crest")
plt.title('Heatmap of the dataset')
plt.show()
Next, I'll create a stacked bar plot to visualize the number of employees across departments, comparing those who left with those who did not.
# Creating a stacked bar plot to visualize the number of employees across departments, comparing those who left with those who didn't
# In the legend, 0 (purple color) represents employees who did not leave, 1 (red color) represents employees who left
pd.crosstab(df1['department'], df1['left']).plot(kind ='bar',color='mr')
plt.title('Counts of employees who left versus stayed across departments')
plt.ylabel('Employee count')
plt.xlabel('Department')
plt.show()
Observation:
- Since logistic regression is quite sensitive to outliers, at this stage I'll remove the outliers in the
tenurecolumn that were identified earlier.
# Selecting rows without outliers in `tenure` and save resulting dataframe in a new variable
df_logreg = df_enc[(df_enc['tenure'] >= lower_limit) & (df_enc['tenure'] <= upper_limit)]
# Displaying first few rows of new dataframe
df_logreg.head()
| satisfaction_level | last_evaluation | number_project | average_monthly_hours | tenure | work_accident | left | promotion_last_5years | salary | department_IT | department_RandD | department_accounting | department_hr | department_management | department_marketing | department_product_mng | department_sales | department_support | department_technical | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.38 | 0.53 | 2 | 157 | 3 | 0 | 1 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
| 2 | 0.11 | 0.88 | 7 | 272 | 4 | 0 | 1 | 0 | 1 | False | False | False | False | False | False | False | True | False | False |
| 3 | 0.72 | 0.87 | 5 | 223 | 5 | 0 | 1 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
| 4 | 0.37 | 0.52 | 2 | 159 | 3 | 0 | 1 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
| 5 | 0.41 | 0.50 | 2 | 153 | 3 | 0 | 1 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
Next, I'll isolate the outcome variable, which is the variable I want the model to predict.
# Isolating the outcome variable
y = df_logreg['left']
# Displaying first few rows of the outcome variable
y.head()
0 1 2 1 3 1 4 1 5 1 Name: left, dtype: int64
Next, I'll select the features I want to use in the model. I'll consider which variables will help to predict the outcome variable, left.
# Selecting the features I want to use in the model by dropping the outcome variable
X = df_logreg.drop('left', axis=1)
# Display the first few rows of the selected features
X.head()
| satisfaction_level | last_evaluation | number_project | average_monthly_hours | tenure | work_accident | promotion_last_5years | salary | department_IT | department_RandD | department_accounting | department_hr | department_management | department_marketing | department_product_mng | department_sales | department_support | department_technical | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.38 | 0.53 | 2 | 157 | 3 | 0 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
| 2 | 0.11 | 0.88 | 7 | 272 | 4 | 0 | 0 | 1 | False | False | False | False | False | False | False | True | False | False |
| 3 | 0.72 | 0.87 | 5 | 223 | 5 | 0 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
| 4 | 0.37 | 0.52 | 2 | 159 | 3 | 0 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
| 5 | 0.41 | 0.50 | 2 | 153 | 3 | 0 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
Now, I'll split the data into a training set and testing set. I'll also stratify based on the values in y, since the classes are unbalanced.
# Splitting the data into a training set and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, stratify=y, random_state=42)
Constructing a logistic regression model and fitting it to the training dataset.
# Constructing a logistic regression model and fitting it to the training dataset
log_clf = LogisticRegression(random_state=42, max_iter=500).fit(X_train, y_train)
Testing the logistic regression model by using the model to make predictions on the test set.
# Using the logistic regression model to get predictions on the test set
y_pred = log_clf.predict(X_test)
Creating a confusion matrix to visualize the results of the logistic regression model.
# Computing values for confusion matrix
log_cm = confusion_matrix(y_test, y_pred, labels=log_clf.classes_)
# Creating a display of the confusion matrix
log_disp = ConfusionMatrixDisplay(confusion_matrix=log_cm,
display_labels=log_clf.classes_)
# Plotting the confusion matrix
log_disp.plot(values_format='')
# Displaying plot
plt.show()
Observations:
- Upper-left quadrant True negatives(
TN): The number of people who did not leave that the model accurately predicted did not leave.- Bottom-right quadrant True positives(
TP): The number of people who left the model accurately predicted as leaving- Upper-right quadrant False positives(
FP): The number of people who did not leave the model inaccurately predicted as leaving.- Bottom-left quadrant False negatives(
FN): The number of people who left that the model inaccurately predicted did not leave
A perfect model would yield all true negatives and true positives, and no false negatives or false positives.
Next, I'll check the class balance in the data. In other words, check the value counts in the left column. Since this is a binary classification task, the class balance informs the way I interpret accuracy metrics.
df_logreg['left'].value_counts(normalize=True)
left 0 0.831468 1 0.168532 Name: proportion, dtype: float64
Observations:
- There is an approximately
83%-17%split. So the data is not perfectly balanced, but it is not too imbalanced. If it was more severely imbalanced, I might want to resample the data to make it more balanced. In this case, I can use this data without modifying the class balance and continue evaluating the model.
Next, I'll create a classification report that includes precision, recall, f1-score, and accuracy metrics to evaluate the performance of the logistic regression model.
# Creating a classification report for the logistic regression model
target_names = ['Predicted would not leave', 'Predicted would leave']
print(classification_report(y_test, y_pred, target_names=target_names))
precision recall f1-score support
Predicted would not leave 0.86 0.93 0.90 2321
Predicted would leave 0.44 0.26 0.33 471
accuracy 0.82 2792
macro avg 0.65 0.60 0.61 2792
weighted avg 0.79 0.82 0.80 2792
Observations:
- The classification report above shows that the logistic regression model achieved a precision of
79%, recall of82%, f1-score of80%(all weighted averages), and accuracy of82%. However, if it's most important to predict employees who leave, then the scores are significantly lower.
Second: Building a Tree-Based Model¶
To start, I'll isolate the outcome variable
# Isolating the outcome variable
y = df_enc['left']
# Displaying the first few rows of `y`
y.head()
0 1 1 1 2 1 3 1 4 1 Name: left, dtype: int64
Selecting the features
# Selecting the features
X = df_enc.drop('left', axis=1)
# Displaying the first few rows of `X`
X.head()
| satisfaction_level | last_evaluation | number_project | average_monthly_hours | tenure | work_accident | promotion_last_5years | salary | department_IT | department_RandD | department_accounting | department_hr | department_management | department_marketing | department_product_mng | department_sales | department_support | department_technical | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.38 | 0.53 | 2 | 157 | 3 | 0 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
| 1 | 0.80 | 0.86 | 5 | 262 | 6 | 0 | 0 | 1 | False | False | False | False | False | False | False | True | False | False |
| 2 | 0.11 | 0.88 | 7 | 272 | 4 | 0 | 0 | 1 | False | False | False | False | False | False | False | True | False | False |
| 3 | 0.72 | 0.87 | 5 | 223 | 5 | 0 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
| 4 | 0.37 | 0.52 | 2 | 159 | 3 | 0 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
Splitting the data into training, validating, and testing sets.
# Splitting the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, stratify=y, random_state=0)
Decision tree - Round 1
- I'll construct a decision tree model and set up cross-validated grid-search to exhuastively search for the best model parameters.
# Instantiating the model
tree = DecisionTreeClassifier(random_state=0)
# Assigning a dictionary of hyperparameters to search over
cv_params = {'max_depth':[4, 6, 8, None],
'min_samples_leaf': [2, 5, 1],
'min_samples_split': [2, 4, 6]
}
# Assigning a dictionary of scoring metrics to capture
scoring = {'accuracy', 'precision', 'recall', 'f1', 'roc_auc'}
# Instantiating GridSearch
tree1 = GridSearchCV(tree, cv_params, scoring=scoring, cv=4, refit='roc_auc')
Fitting the decision tree model to the training data.
%%time
tree1.fit(X_train, y_train)
Wall time: 3.03 s
GridSearchCV(cv=4, estimator=DecisionTreeClassifier(random_state=0),
param_grid={'max_depth': [4, 6, 8, None],
'min_samples_leaf': [2, 5, 1],
'min_samples_split': [2, 4, 6]},
refit='roc_auc',
scoring={'recall', 'precision', 'roc_auc', 'f1', 'accuracy'})
Identifying the optimal values for the decision tree parameters.
# Checking the best parameters
tree1.best_params_
{'max_depth': 4, 'min_samples_leaf': 5, 'min_samples_split': 2}
Identifying the best AUC score achieved by the decision tree model on the training set.
# Checking the best AUC score on CV
tree1.best_score_
0.969819392792457
This is a strong AUC score, which shows that this model can predict employees who will leave very well.
Next, I'll write a function that will help me extract all the scores from the grid search.
def make_results(model_name:str, model_object, metric:str):
'''
Arguments:
model_name (string): what I want the model to be called in the output table
model_object: a fit GridSearchCV object
metric (string): precision, recall, f1, accuracy, or auc
Returns a pandas df with the F1, recall, precision, accuracy, and auc scores
for the model with the best mean 'metric' score across all validation folds.
'''
# Creating a dictionary that maps input metric to actual metric name in GridSearchCV
metric_dict = {'auc': 'mean_test_roc_auc',
'precision': 'mean_test_precision',
'recall': 'mean_test_recall',
'f1': 'mean_test_f1',
'accuracy': 'mean_test_accuracy'
}
# Getting all the results from the CV and putting them in a df
cv_results = pd.DataFrame(model_object.cv_results_)
# Isolating the row of the df with the max(metric) score
best_estimator_results = cv_results.iloc[cv_results[metric_dict[metric]].idxmax(), :]
# Extracting Accuracy, precision, recall, and f1 score from that row
auc = best_estimator_results.mean_test_roc_auc
f1 = best_estimator_results.mean_test_f1
recall = best_estimator_results.mean_test_recall
precision = best_estimator_results.mean_test_precision
accuracy = best_estimator_results.mean_test_accuracy
# Creating a table of results
table = pd.DataFrame()
table = pd.DataFrame({'model': [model_name],
'precision': [precision],
'recall': [recall],
'F1': [f1],
'accuracy': [accuracy],
'auc': [auc]
})
return table
Next, I'll use the function I just defined to get all the scores from grid search.
# Getting all CV scores
tree1_cv_results = make_results('decision tree cv', tree1, 'auc')
tree1_cv_results
| model | precision | recall | F1 | accuracy | auc | |
|---|---|---|---|---|---|---|
| 0 | decision tree cv | 0.914552 | 0.916949 | 0.915707 | 0.971978 | 0.969819 |
Observations:
All of these scores from the decision tree model are strong indicators of good model performance.
Decision trees can be vulnerable to overfitting, and random forests avoid overfitting by incorporating multiple trees to make predictions. I'll construct a random forest model next.
Random Forest - Round 1
- I'll construct a random forest model and set up cross-validated grid-search to exhuastively search for the best model parameters.
# Instantiating the model
rf = RandomForestClassifier(random_state=0)
# Assigning a dictionary of hyperparameters to search over
cv_params = {'max_depth': [3,5, None],
'max_features': [1.0],
'max_samples': [0.7, 1.0],
'min_samples_leaf': [1,2,3],
'min_samples_split': [2,3,4],
'n_estimators': [300, 500],
}
# Assigning a dictionary of scoring metrics to capture
scoring = {'accuracy', 'precision', 'recall', 'f1', 'roc_auc'}
# Instantiating GridSearch
rf1 = GridSearchCV(rf, cv_params, scoring=scoring, cv=4, refit='roc_auc')
Fitting the random forest model to the training data.
%%time
rf1.fit(X_train, y_train) # --> Wall time: 10min
Wall time: 13min 31s
GridSearchCV(cv=4, estimator=RandomForestClassifier(random_state=0),
param_grid={'max_depth': [3, 5, None], 'max_features': [1.0],
'max_samples': [0.7, 1.0],
'min_samples_leaf': [1, 2, 3],
'min_samples_split': [2, 3, 4],
'n_estimators': [300, 500]},
refit='roc_auc',
scoring={'recall', 'precision', 'roc_auc', 'f1', 'accuracy'})
Specifying the path to where I want to save this model
# Defining a path to the folder where I want to save the model
path = 'C:/Users/cneez/Python Code'
Defining functions to pickle the model and read in the model.
def write_pickle(path, model_object, save_as: str):
'''
In:
path: Folder path where I want to save the pickle
model_object: The object/model I want to pickle
save_as: Filename to save the model as
Out: Saves the model as a pickle file in the specified folder
'''
full_path = os.path.join(path, save_as + '.pickle') # Correctly join paths
with open(full_path, 'wb') as to_write:
pickle.dump(model_object, to_write)
print(f"Pickle file saved at: {full_path}")
def read_pickle(path, saved_model_name: str):
'''
In:
path: Folder path where the pickle file is located
saved_model_name: Filename of the pickled model I want to read
Out:
model: The unpickled model object
'''
full_path = os.path.join(path, saved_model_name + '.pickle') # Correctly join paths
with open(full_path, 'rb') as to_read:
model = pickle.load(to_read)
print(f"Pickle file loaded from: {full_path}")
return model
I'll use the functions defined above to save the model in a pickle file and then read it in.
# Writing pickle
write_pickle(path, rf1, 'hr_rf1')
Pickle file saved at: C:/Users/cneez/Python Code\hr_rf1.pickle
# Reading pickle
rf1 = read_pickle(path, 'hr_rf1')
Pickle file loaded from: C:/Users/cneez/Python Code\hr_rf1.pickle
Identifying the best AUC score achieved by the random forest model on the training set.
# Checking the best AUC score on CV
rf1.best_score_
0.9804250949807172
Identifying the optimal values for the parameters of the random forest model.
# Checking the best params
rf1.best_params_
{'max_depth': 5,
'max_features': 1.0,
'max_samples': 0.7,
'min_samples_leaf': 1,
'min_samples_split': 4,
'n_estimators': 500}
Collecting the evaluation scores on the training set for the decision tree and random forest models.
# Getting all CV scores
rf1_cv_results = make_results('random forest cv', rf1, 'auc')
print(tree1_cv_results)
print(rf1_cv_results)
model precision recall F1 accuracy auc
0 decision tree cv 0.914552 0.916949 0.915707 0.971978 0.969819
model precision recall F1 accuracy auc
0 random forest cv 0.950023 0.915614 0.932467 0.977983 0.980425
Observations:
- The evaluation scores of the random forest model are better than those of the decision tree model, with the exception of recall (the recall score of the random forest model is approximately 0.001 lower, which is a negligible amount). This indicates that the random forest model mostly outperforms the decision tree model.
Next, I'll evaluate the final model on the test set.
First, I'll define a function that gets all the scores from a model's predictions.
def get_scores(model_name:str, model, X_test_data, y_test_data):
'''
Generate a table of test scores.
In:
model_name (string): How I want the model to be named in the output table
model: A fit GridSearchCV object
X_test_data: numpy array of X_test data
y_test_data: numpy array of y_test data
Out: pandas df of precision, recall, f1, accuracy, and AUC scores for the model
'''
preds = model.best_estimator_.predict(X_test_data)
auc = roc_auc_score(y_test_data, preds)
accuracy = accuracy_score(y_test_data, preds)
precision = precision_score(y_test_data, preds)
recall = recall_score(y_test_data, preds)
f1 = f1_score(y_test_data, preds)
table = pd.DataFrame({'model': [model_name],
'precision': [precision],
'recall': [recall],
'f1': [f1],
'accuracy': [accuracy],
'AUC': [auc]
})
return table
Now, I'll use the best performing model to predict on the test set.
# Getting predictions on test data
rf1_test_scores = get_scores('random forest1 test', rf1, X_test, y_test)
rf1_test_scores
| model | precision | recall | f1 | accuracy | AUC | |
|---|---|---|---|---|---|---|
| 0 | random forest1 test | 0.964211 | 0.919679 | 0.941418 | 0.980987 | 0.956439 |
Observations:
- The test scores are very similar to the validation scores, which is good. This appears to be a strong model. Since this test set was only used for this model, I can be more confident that the model's performance on this data is representative of how it will perform on new, unseeen data.
Feature Engineering¶
- I am a little skeptical of the high evaluation scores. There is a chance that there is some data leakage occurring. Data leakage is when you use data to train your model that should not be used during training, either because it appears in the test data or because it's not data that you'd expect to have when the model is actually deployed. Training a model with leaked data can give an unrealistic score that is not replicated in production.
- In this case, it's likely that the company won't have satisfaction levels reported for all of its employees. It's also possible that the
average_monthly_hourscolumn is a source of some data leakage. If employees have already decided upon quitting, or have already been identified by management as people to be fired, they may be working fewer hours.- The first round of decision tree and random forest models included all variables as features. This next round will incorporate feature engineering to build improved models.
- I'll proceed by dropping
satisfaction_leveland creating a new feature that roughly captures whether an employee is overworked. I'll call this new featureoverworked. It will be a binary variable.
# Dropping `satisfaction_level` and saving the resulting dataframe in new variable
df2 = df_enc.drop('satisfaction_level', axis=1)
# Displaying first few rows of new dataframe
df2.head()
| last_evaluation | number_project | average_monthly_hours | tenure | work_accident | left | promotion_last_5years | salary | department_IT | department_RandD | department_accounting | department_hr | department_management | department_marketing | department_product_mng | department_sales | department_support | department_technical | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.53 | 2 | 157 | 3 | 0 | 1 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
| 1 | 0.86 | 5 | 262 | 6 | 0 | 1 | 0 | 1 | False | False | False | False | False | False | False | True | False | False |
| 2 | 0.88 | 7 | 272 | 4 | 0 | 1 | 0 | 1 | False | False | False | False | False | False | False | True | False | False |
| 3 | 0.87 | 5 | 223 | 5 | 0 | 1 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
| 4 | 0.52 | 2 | 159 | 3 | 0 | 1 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
# Creating the `overworked` column. For now, it's identical to average monthly hours.
df2['overworked'] = df2['average_monthly_hours']
# Inspecting max and min average monthly hours values
print('Max hours:', df2['overworked'].max())
print('Min hours:', df2['overworked'].min())
Max hours: 310 Min hours: 96
Observations:
166.67 is approximately the average number of monthly hours for someone who works 50 weeks per year, 5 days per week, 8 hours per day.
One could define being overworked as working more than 175 hours per month on average.
To make the overworked column binary, I'll reassign the column using a boolean mask.
df3['overworked'] > 175creates a series of booleans, consisting ofTruefor every value > 175 andFalsefor every value ≤ 175.astype(int)converts allTrueto1and allFalseto0
# Defining `overworked` as working > 175 hrs/week
df2['overworked'] = (df2['overworked'] > 175).astype(int)
# Displaying first few rows of new column
df2['overworked'].head()
0 0 1 1 2 1 3 1 4 0 Name: overworked, dtype: int32
Dropping the average_monthly_hours column.
# Dropping the `average_monthly_hours` column
df2 = df2.drop('average_monthly_hours', axis=1)
# Displaying first few rows of resulting dataframe
df2.head()
| last_evaluation | number_project | tenure | work_accident | left | promotion_last_5years | salary | department_IT | department_RandD | department_accounting | department_hr | department_management | department_marketing | department_product_mng | department_sales | department_support | department_technical | overworked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.53 | 2 | 3 | 0 | 1 | 0 | 0 | False | False | False | False | False | False | False | True | False | False | 0 |
| 1 | 0.86 | 5 | 6 | 0 | 1 | 0 | 1 | False | False | False | False | False | False | False | True | False | False | 1 |
| 2 | 0.88 | 7 | 4 | 0 | 1 | 0 | 1 | False | False | False | False | False | False | False | True | False | False | 1 |
| 3 | 0.87 | 5 | 5 | 0 | 1 | 0 | 0 | False | False | False | False | False | False | False | True | False | False | 1 |
| 4 | 0.52 | 2 | 3 | 0 | 1 | 0 | 0 | False | False | False | False | False | False | False | True | False | False | 0 |
Again, I'll isolate the features and target variables.
# Isolating the outcome variable
y = df2['left']
# Selecting the features
X = df2.drop('left', axis=1)
Splitting the data into training and testing sets.
# Creating the test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, stratify=y, random_state=0)
Decision Tree - Round 2
# Instantiating the model
tree = DecisionTreeClassifier(random_state=0)
# Assigning a dictionary of hyperparameters to search over
cv_params = {'max_depth':[4, 6, 8, None],
'min_samples_leaf': [2, 5, 1],
'min_samples_split': [2, 4, 6]
}
# Assigning a dictionary of scoring metrics to capture
scoring = {'accuracy', 'precision', 'recall', 'f1', 'roc_auc'}
# Instantiating GridSearch
tree2 = GridSearchCV(tree, cv_params, scoring=scoring, cv=4, refit='roc_auc')
%%time
tree2.fit(X_train, y_train)
Wall time: 2.56 s
GridSearchCV(cv=4, estimator=DecisionTreeClassifier(random_state=0),
param_grid={'max_depth': [4, 6, 8, None],
'min_samples_leaf': [2, 5, 1],
'min_samples_split': [2, 4, 6]},
refit='roc_auc',
scoring={'recall', 'precision', 'roc_auc', 'f1', 'accuracy'})
# Checking the best params
tree2.best_params_
{'max_depth': 6, 'min_samples_leaf': 2, 'min_samples_split': 6}
# Checking the best AUC score on CV
tree2.best_score_
0.9586752505340426
Observation:
- This model performs very well, even without satisfaction levels and detailed hours worked data.
Next, I'll check the other scores.
# Getting all CV scores
tree2_cv_results = make_results('decision tree2 cv', tree2, 'auc')
print(tree1_cv_results)
print(tree2_cv_results)
model precision recall F1 accuracy auc
0 decision tree cv 0.914552 0.916949 0.915707 0.971978 0.969819
model precision recall F1 accuracy auc
0 decision tree2 cv 0.856693 0.903553 0.878882 0.958523 0.958675
Observations:
- Some of the other scores fell. That's to be expected given fewer features were taken into account in this round of the model. Still, the scores are very good.
Random Forest - Round 2
# Instantiating the model
rf = RandomForestClassifier(random_state=0)
# Assigning a dictionary of hyperparameters to search over
cv_params = {'max_depth': [3,5, None],
'max_features': [1.0],
'max_samples': [0.7, 1.0],
'min_samples_leaf': [1,2,3],
'min_samples_split': [2,3,4],
'n_estimators': [300, 500],
}
# Assigning a dictionary of scoring metrics to capture
scoring = {'accuracy', 'precision', 'recall', 'f1', 'roc_auc'}
# Instantiating GridSearch
rf2 = GridSearchCV(rf, cv_params, scoring=scoring, cv=4, refit='roc_auc')
%%time
rf2.fit(X_train, y_train) # --> Wall time: 7min
Wall time: 10min 8s
GridSearchCV(cv=4, estimator=RandomForestClassifier(random_state=0),
param_grid={'max_depth': [3, 5, None], 'max_features': [1.0],
'max_samples': [0.7, 1.0],
'min_samples_leaf': [1, 2, 3],
'min_samples_split': [2, 3, 4],
'n_estimators': [300, 500]},
refit='roc_auc',
scoring={'recall', 'precision', 'roc_auc', 'f1', 'accuracy'})
# Writing pickle
write_pickle(path, rf2, 'hr_rf2')
Pickle file saved at: C:/Users/cneez/Python Code\hr_rf2.pickle
# Reading in pickle
rf2 = read_pickle(path, 'hr_rf2')
Pickle file loaded from: C:/Users/cneez/Python Code\hr_rf2.pickle
# Checking the best params
rf2.best_params_
{'max_depth': 5,
'max_features': 1.0,
'max_samples': 0.7,
'min_samples_leaf': 2,
'min_samples_split': 2,
'n_estimators': 300}
# Checking the best AUC score on CV
rf2.best_score_
0.9648100662833985
# Getting all round 2 CV scores
rf2_cv_results = make_results('random forest2 cv', rf2, 'auc')
print(tree2_cv_results)
print(rf2_cv_results)
model precision recall F1 accuracy auc
0 decision tree2 cv 0.856693 0.903553 0.878882 0.958523 0.958675
model precision recall F1 accuracy auc
0 random forest2 cv 0.866758 0.878754 0.872407 0.957411 0.96481
Observation:
- Again, the scores dropped slightly, but the random forest performs better than the decision tree if using AUC as the deciding metric.
Scoring the champion model on the test set.
# Getting predictions on test data
rf2_test_scores = get_scores('random forest2 test', rf2, X_test, y_test)
rf2_test_scores
| model | precision | recall | f1 | accuracy | AUC | |
|---|---|---|---|---|---|---|
| 0 | random forest2 test | 0.870406 | 0.903614 | 0.8867 | 0.961641 | 0.938407 |
This seems to be a stable, well-performing model.
Next, I'll plot a confusion matrix to visualize how well the model predicts on the test set.
# Generating an array of values for the confusion matrix
preds = rf2.best_estimator_.predict(X_test)
cm = confusion_matrix(y_test, preds, labels=rf2.classes_)
# Plotting confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
display_labels=rf2.classes_)
disp.plot(values_format='');
Observations:
- The model predicts more false positives than false negatives, which means that some employees may be identified as at risk of quitting or getting fired, when that's actually not the case. But this is still a strong model.
- For exploratory purposes, I'll inspect the splits of the decision tree model and the most important features in the random forest model.
Decision Tree Splits
# Plotting the tree
plt.figure(figsize=(85,20))
plot_tree(tree2.best_estimator_, max_depth=6, fontsize=14, feature_names=X.columns,
class_names={0:'stayed', 1:'left'}, filled=True);
plt.show()
Decision Tree Feature Importance
Now, I'll plot the feature importance for the decision tree model.
#tree2_importances = pd.DataFrame(tree2.best_estimator_.feature_importances_, columns=X.columns)
tree2_importances = pd.DataFrame(tree2.best_estimator_.feature_importances_,
columns=['gini_importance'],
index=X.columns
)
tree2_importances = tree2_importances.sort_values(by='gini_importance', ascending=False)
# Only extracting the features with importances > 0
tree2_importances = tree2_importances[tree2_importances['gini_importance'] != 0]
tree2_importances
| gini_importance | |
|---|---|
| last_evaluation | 0.343958 |
| number_project | 0.343385 |
| tenure | 0.215681 |
| overworked | 0.093498 |
| department_support | 0.001142 |
| salary | 0.000910 |
| department_sales | 0.000607 |
| department_technical | 0.000418 |
| work_accident | 0.000183 |
| department_IT | 0.000139 |
| department_marketing | 0.000078 |
Next, I'll create a barplot to visualize the decision tree feature importance.
sns.barplot(data=tree2_importances, x="gini_importance", y=tree2_importances.index, orient='h')
plt.title("Decision Tree: Feature Importance for Employees Leaving", fontsize=12)
plt.ylabel("Feature")
plt.xlabel("Importance")
plt.show()
Observations:
- The barplot above shows that in this decision tree model,
last_evaluation,number_project,tenure, andoverworkedhave the highest importance, in that order. These variables are most helpful in predicting the outcome variable,left.
Random Tree Feature Importance
Next, I'll plot the feature importance for the random forest model.
# Gettig feature importance
feat_impt = rf2.best_estimator_.feature_importances_
# Getting indices of top 10 features
ind = np.argpartition(rf2.best_estimator_.feature_importances_, -10)[-10:]
# Getting column labels of top 10 features
feat = X.columns[ind]
# Filtering `feat_impt` to consist of top 10 feature importances
feat_impt = feat_impt[ind]
y_df = pd.DataFrame({"Feature":feat,"Importance":feat_impt})
y_sort_df = y_df.sort_values("Importance")
fig = plt.figure()
ax1 = fig.add_subplot(111)
y_sort_df.plot(kind='barh',ax=ax1,x="Feature",y="Importance")
ax1.set_title("Random Forest: Feature Importances for Employee Leaving", fontsize=12)
ax1.set_ylabel("Feature")
ax1.set_xlabel("Importance")
plt.show()
Observations:
- The plot above shows that in this random forest model,
last_evaluation,number_project,tenure, andoverworkedhave the highest importance, in that order. These variables are most helpful in predicting the outcome variable,left, and they are the same as the ones used by the decision tree model.
PACE: Execute¶
Within the execute phase of the PACE framework I will:
- Interpret model performance and results
- Share actionable steps with stakeholders
Recalling evaluation metrics¶
- AUC is the area under the ROC curve; it's also considered the probability that the model ranks a random positive example more highly than a random negative example.
- Precision measures the proportion of data points predicted as True that are actually True, in other words, the proportion of positive predictions that are true positives.
- Recall measures the proportion of data points that are predicted as True, out of all the data points that are actually True. In other words, it measures the proportion of positives that are correctly classified.
- Accuracy measures the proportion of data points that are correctly classified.
- F1-score is an aggregation of precision and recall.
Step 4. Results and Evaluation¶
Within the next steps I'll complete the following:
- Interpret model
- Evaluate model performance using metrics
- Prepare results, visualizations, and actionable steps to share with stakeholders
Summary of model results¶
Logistic Regression
The logistic regression model achieved precision of 80%, recall of 83%, f1-score of 80% (all weighted averages), and accuracy of 83%, on the test set.
Tree-based Machine Learning
After conducting feature engineering, the decision tree model achieved AUC of 93.8%, precision of 87.0%, recall of 90.4%, f1-score of 88.7%, and accuracy of 96.2%, on the test set. The random forest modestly outperformed the decision tree model.
Conclusion, Recommendations, Next Steps¶
The models and the feature importances extracted from the models confirm that employees at the company are overworked.
To retain employees, the following recommendations could be presented to the stakeholders:
- Cap the number of projects that employees can work on.
- Consider promoting employees who have been with the company for at least four years, or conduct further investigation about why four-year tenured employees are so dissatisfied.
- Either reward employees for working longer hours, or don't require them to do so.
- If employees aren't familiar with the company's overtime pay policies, inform them about this. If the expectations around workload and time off aren't explicit, make them clear.
- Hold company-wide and within-team discussions to understand and address the company work culture, across the board and in specific contexts.
- High evaluation scores should not be reserved for employees who work 200+ hours per month. Consider a proportionate scale for rewarding employees who contribute more/put in more effort.
Next Steps
It may be justified to still have some concern about data leakage. It could be prudent to consider how predictions change when last_evaluation is removed from the data. It's possible that evaluations aren't performed very frequently, in which case it would be useful to be able to predict employee retention without this feature. It's also possible that the evaluation score determines whether an employee leaves or stays, in which case it could be useful to pivot and try to predict performance scores. The same could be said for satisfaction scores.
For another project, I could try building a K-means model on this data and analyzing the clusters. This may yield valuable insight.